Okay, but what else can we do? Obviously what we want to do is we want to get rid of the
summation. That's our problem. These are just too long. So if we only want to make predictions
with respect to the most probable hypothesis, we can do something else. So what we can do
is we can actually look at the most probable hypothesis and directly use that. Instead
of summing up over all the hypothesis to get the real picture, we'll just sum up over the
most probable. And we hope that this gives us good values. We have to lose somewhere.
And here we're gaining big. So we turn the whole thing into an optimization problem and
we have to only find out this most probable hypothesis, which we can do by a maximization
argument. We're maximizing over all hypothesis given the data. And we can do that either
by maximizing this or even better, by maximizing the logs. Why is that good? Well, we still
have a multiplication here. We don't like multiplication. We like addition much better.
So if you wrap a log around it, then we only have to add. And this is not going to destroy
anything, because generally if we maximize, we can just kind of basically wrap any function
that is, for instance, monotonic around it and log it. That's just a rescaling. What
you can't do is wrap such a function around it. So what we're going to do, and that's
the standard trick, is we're going to go to log maximization. The observation is that
if we have enough samples, this MAP learning turns out to be Bayesian approximately. And
if we go back to this thing here, if this is somehow correct, if this somehow shows
us what's actually going on, then we'll converge on a most likely hypothesis, while all the
others kind of die down eventually. So the prediction of Bayesian learning is that we
have, the best hypothesis case is correct after a while. And if we must actually approximate,
so assuming that after a while we have a good hypothesis, the most probable one, is actually
a good idea. So what you have is that if you have a deterministic
hypothesis, then you have a deterministic situation, as you have in science very often.
If you're trying to predict or learn the current from the voltage and the resistance,
then there's a deterministic and non-statistical relationship between those two. We actually
have in the limit, the MAP becomes the actual deterministic hypothesis. So we're in the
best of both worlds. We're becoming full Bayesian, and in deterministic cases, we actually find
the right one. And for computer scientists, of course, getting
rid of these huge sums in favor of an optimization problem is something we like. Optimization
is easy, you just do gradient descent or something like that. Or even if you have symbolic solutions,
you can just do it by partial differentiation and so on.
There's a wrinkle, which I would like to come back to, and just to basically make you aware
that there is a connection here. You remember when we were combating, overfitting a while
back, we had this idea that we would put kind of a size term into the optimization problem
so that during optimization we would also bias our search, our optimization, towards
simple solutions, small solutions. That kind of has an implementation of Occam's razor.
That was called regularization. In regularization, we had a special case where we basically used
the information content of a hypothesis. The idea was we typically had to have this proportionality
factor to align the scales of optimizing for a solution, for a maximum, and optimizing
for regularization. The idea was that if we could express both of them at the same scale,
then we don't need this factor, which we didn't know how to pick anyway. That came down to
this idea of minimum description length, where the idea was, well, if we can just basically
look at the information content of the solution and the hypothesis and turn both of them into
Turing machine programs, then this would theoretically be beautiful. It actually works in practice,
somewhat surprisingly. Since we're logging here anyway, we're not that far off from using
minimal description length. There is a variant of learning which uses exactly the same ideas
as we had before for regularization. That's called minimum description length learning.
In our example with the candies, it predicts exactly the right thing. We're minimizing
the information entropy of the hypothesis. The idea is exactly as always, small or simple
Presenters
Zugänglich über
Offener Zugang
Dauer
00:14:58 Min
Aufnahmedatum
2021-03-30
Hochgeladen am
2021-03-30 17:28:02
Sprache
en-US
Maximum A Posteriori Approximation and Maximum Likelyhood Approximation for Bayesian Learning.